A combination of DHTs and Peer Clustering for Distributed Information Retrieval
نویسندگان
چکیده
Distributed Hash Tables (DHTs) are very efficient for querying based on key lookups, if only a small number of keys has to be registered by each individual peer. However, building huge term indexes, as required for IR-style keyword search, are impractical with plain DHTs. Due to the large sizes of document term vocabularies, joining peers cause huge amounts of key inserts, and subsequently large numbers of index maintenance messages. Thus, the key to exploiting DHTs for distributed information retrieval is to reduce index maintenance. We show that this can be achieved by combining DHTs with peer clustering. Peers are first clustered into communities, each of the communities having a representative super-peer. Then all occurrences of a term in a community are published to the global DHT in a batch by the representative super-peer. Our evaluation shows that this reduces index maintenance cost by an order of magnitude, while still keeping a complete and correct term index for query processing.
منابع مشابه
Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing
Advanced applications for Distributed Hash Tables (DHTs), such as Peer-to-Peer Information Retrieval, require a DHT to quickly and efficiently process a large number (in the order of millions) of requests. In this paper we study mechanisms to optimize the throughput of DHTs. Our goal is to maximize the number of route operations per peer per second a DHT can perform (given certain constraints o...
متن کاملAggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test
There has been an increasing research interest in developing full-text retrieval based on peer-to-peer (P2P) technology. So far, these research efforts have largely concentrated on efficiently distributing an index. However, ranking of the results retrieved from the index is a crucial part in information retrieval. To determine the relevance of a document to a query, ranking algorithms use coll...
متن کاملA Tabu-Based Cache to Improve Range Queries on Prefix Trees
Distributed Hash Tables (DHTs) provide the substrate to build large scale distributed applications over Peerto-Peer networks. A major limitation of DHTs is that they only support exact-match queries. In order to offer range queries over a DHT it is necessary to build additional indexing structures. Prefix-based indexes, such as Prefix Hash Tree (PHT), are interesting approaches for building dis...
متن کاملPCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing
Distributed hash tables (DHTs) are very efficient for querying based on key lookups. However, building huge term indexes, as required for IR-style keyword search, poses a scalability challenge for plain DHTs. Due to the large sizes of document term vocabularies, peers joining the network cause huge amounts of key inserts and, consequently, a large number of index maintenance messages. Thus, the...
متن کاملAggregation of a Term Vocabulary for P2P-IR: A DHT Stress Test
There has been an increasing research interest in developing full-text retrieval based on peer-to-peer (P2P) technology. So far, these research efforts have largely concentrated on efficiently distributing an index. However, ranking of the results retrieved from the index is a crucial part in information retrieval. To determine the relevance of a document to a query, ranking algorithms use coll...
متن کامل